Goto

Collaborating Authors

 Educational Standards


MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations

Lazzaroni, Ruggero Marino, Angioi, Alessandro, Puliga, Michelangelo, Sanna, Davide, Marras, Roberto

arXiv.org Artificial Intelligence

Large language models (LLMs) show increasing potential in education, yet benchmarks for non-English languages in specialized domains remain scarce. We introduce MedBench-IT, the first comprehensive benchmark for evaluating LLMs on Italian medical university entrance examinations. Sourced from Edizioni Simone, a leading preparatory materials publisher, MedBench-IT comprises 17,410 expert-written multiple-choice questions across six subjects (Biology, Chemistry, Logic, General Culture, Mathematics, Physics) and three difficulty levels. We evaluated diverse models including proprietary LLMs (GPT-4o, Claude series) and resource-efficient open-source alternatives (<30B parameters) focusing on practical deployability. Beyond accuracy, we conducted rigorous reproducibility tests (88.86% response consistency, varying by subject), ordering bias analysis (minimal impact), and reasoning prompt evaluation. We also examined correlations between question readability and model performance, finding a statistically significant but small inverse relationship. MedBench-IT provides a crucial resource for Italian NLP community, EdTech developers, and practitioners, offering insights into current capabilities and standardized evaluation methodology for this critical domain.


Evaluating Multimodal Generative AI with Korean Educational Standards

Park, Sanghee, Kim, Geewook

arXiv.org Artificial Intelligence

This paper presents the Korean National Educational Test Benchmark (KoNET), a new benchmark designed to evaluate Multimodal Generative AI Systems using Korean national educational tests. KoNET comprises four exams: the Korean Elementary General Educational Development Test (KoEGED), Middle (KoMGED), High (KoHGED), and College Scholastic Ability Test (KoCSAT). These exams are renowned for their rigorous standards and diverse questions, facilitating a comprehensive analysis of AI performance across different educational levels. By focusing on Korean, KoNET provides insights into model performance in less-explored languages. We assess a range of models - open-source, open-access, and closed APIs - by examining difficulties, subject diversity, and human error rates. The code and dataset builder will be made fully open-sourced at https://github.com/naver-ai/KoNET.


Digitizing educational standards to make learning materials reusable across countries

#artificialintelligence

Consider a refugee population coming from country C residing in host country B, with limited or no access to education. The trauma of conflict and displacement, coupled with the difficulty of integration within the host country puts refugee populations at a significant educational disadvantage, so it is worthwhile considering options that could "level the playing field" by providing improved access to education. There is hope that the vast amounts of Open Educational Resources (OER) that are freely available on the internet can play a role in this, in particular in combination with educational platforms like Kolibri. The Kolibri platform aims to provide access to learning opportunities for all and it is particularly suited for the refugee context as the runs-anywhere capabilities of the Kolibri applications allow it to be accessed in computer labs, in the classroom, from phones, and in informal learning centres. Our experience and work with partners like UNHCR have shown that in emergency and crisis contexts, a key bottleneck is the lack of sufficient educational content aligned to the learning goals of the project.